February 18, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Backpropogation is the main workhorse for training DNNs
Construct the computational graph for the NN
Create initial values for the unknowns
Compute the interemediates as a function of the initial values
Compute the gradient using backprop
Update the unknowns using SGD.
Repeat a bunch of times
For DNNs, there are patterns in the forward and backwards passes that allow us to succinctly write code to compute the complicated gradients!
Quick review:
\(\mathbf y\) is a \(N\)-vector of continuous outcomes
\(\mathbf X\) is a \(N \times P\) matrix of feature values
Loss is MSE
\[\mathcal L(\boldsymbol \theta | \mathbf X, \mathbf y) = \expn (y_i - z_i)^2 = \frac{1}{N} (\mathbf y - \mathbf z)^T(\mathbf y - \mathbf z)\]
2 layer FCNN
Loss layer weights (a \(K_2\) vector)
2nd layer weights (a \(K_1 \times K_2\) matrix)
1st layer weights (a \(P \times K_2\) matrix)
Loss layer:
Loss layer (for a single instance):
\[\frac{\partial \mathcal L}{\partial \mathbf z} = \underset{(1 \times 1)}{-2(y - z)}\]
Output layer:
Output layer (for one instance):
\[\underset{(1 \times 1)}{z} = \underset{(1 \times K_2)}{\bb^T}\underset{(K_2 \times 1)}{\mathbf h_2} + \underset{(1 \times 1)}{a}\]
\[\frac{\partial z}{\partial a} = \underset{(1 \times 1)}{1}\]
\[\frac{\partial z}{\partial \bb} = \underset{(1 \times K_2)}{\mathbf h_2}\]
\[\frac{\partial z}{\partial \mathbf h_2} = \underset{(1 \times K_2)}{\bb^T}\]
Loss gradients:
\[\underset{(1 \times 1)}{z} = \underset{(1 \times K_2)}{\bb^T}\underset{(K_2 \times 1)}{\mathbf h_2} + \underset{(1 \times 1)}{a}\]
\[\frac{\partial \mathcal L_i}{\partial a} = \underset{(1 \times 1)}{-2(y - z)}\]
\[\frac{\partial \mathcal L_i}{\partial \bb} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\mathbf h_2}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T}\]
Hidden Layer 2:
Hidden Layer 2 (for one instance):
\[\underset{(K_2 \times 1)}{\mathbf h_2} = \varphi(\mathbf q_2)\]
\[\frac{\partial \mathbf h_2}{\partial \mathbf q_2} = I(\mathbf q_2 > 0)\mathcal I_{K_2}\]
\[\underset{(K_2 \times 1)}{\mathbf q_2} = \underset{(K_2 \times K_1)}{\mathbf W_2^T} \underset{(K_1 \times 1)}{\mathbf h_1} + \underset{(K_2 \times 1)}{\mathbf b_2}\]
\[\frac{\partial \mathbf q_2}{\partial \mathbf b_2} = \underset{(K_2 \times K_2)}{\mathcal I_{K_2}}\]
\[\frac{\partial \mathbf q_2}{\partial \mathbf W_2} = \text{Complex}\]
\[\frac{\partial \mathbf q_2}{\partial \mathbf h_1} = \mathbf W_2^T\]
Loss gradients:
\[\frac{\partial \mathcal L_i}{\partial \mathbf b_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf h_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{(K_2 \times K_1)}{\mathbf W_2^T}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf W_2^T} = \underset{\mathbf K_2 \times 1}{\mathbf u_2^T} \underset{(1 \times K_1)}{\mathbf h_1}\]
where \(\mathbf u_2\) is the upstream gradient for the second hidden layer:
\[\mathbf u_2 = \frac{\partial \mathcal L_i}{\partial \mathbf q_2} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)}\]
Hidden Layer 1:
Hidden Layer 1 (for one instance):
\[\underset{(K_1 \times 1)}{\mathbf h_1} = \varphi(\mathbf q_1)\]
\[\frac{\partial \mathbf h_1}{\partial \mathbf q_1} = I(\mathbf q_1 > 0)\mathcal I_{K_1}\]
\[\underset{(K_1 \times 1)}{\mathbf q_1} = \underset{(K_1 \times P)}{\mathbf W_1^T} \underset{(P \times 1)}{\mathbf x} + \underset{(K_1 \times 1)}{\mathbf b_1}\]
\[\frac{\partial \mathbf q_1}{\partial \mathbf b_1} = \underset{(K_1 \times K_1)}{\mathcal I_{K_1}}\]
\[\frac{\partial \mathbf q_1}{\partial \mathbf W_1} = \text{Complex}\]
\[\frac{\partial \mathbf q_1}{\partial \mathbf x} = \mathbf W_1^T\]
Loss gradients:
\[\frac{\partial \mathcal L_i}{\partial \mathbf b_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf x} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)} \otimes \underset{((K_1 \times P))}{\mathbf W_1^T}\]
\[\frac{\partial \mathcal L_i}{\partial \mathbf W_1^T} = \underset{\mathbf K_1 \times 1}{\mathbf u_2^T} \underset{(1 \times P)}{\mathbf x}\]
where \(\mathbf u_1\) is the upstream gradient for the first hidden layer:
\[\mathbf u_1 = \frac{\partial \mathcal L_i}{\partial \mathbf q_1} = \underset{(1 \times 1)}{-2(y - z)} \otimes \underset{(1 \times K_2)}{\bb^T} \odot \underset{1 \times K_2}{I(\mathbf q_2 > 0)} \otimes \underset{K_2 \times K_1}{\mathbf W_2^T} \odot \underset{(1 \times K_1)}{I(\mathbf q_1 > 0)}\]
For a generic linear layer, we always have the same gradient form!
The gradient at any layer is dependent on upstream layers through \(\mathbf u\):
\[\mathbf u = \frac{\partial \mathcal L}{\partial z} \otimes \bb^t \odot \varphi'(\mathbf q_D) \otimes \mathbf W_D^T \odot \varphi'(\mathbf q_{D - 1}) \otimes \mathbf W_{D - 1}^T \odot \varphi'(\mathbf q_{D - 2}) \otimes \mathbf W_{D - 2}^T ...\]
where \(\varphi'()\) is the derivative of the activation function
As discussed last class, it is possible for this product chain to explode or vanish
Multiplying a lot of large values together approaches \(\infty\)
Multiplying a lot of small values (close to zero) approaches 0
These are known as exploding gradients or vanishing gradients
The exploding gradient problem can be addressed with gradient clipping or carefully setting starting values using Glorot intialization.
Vanishing gradients are harder to detect and deal with!
Why do you think vanishing gradients are a problem?
Think about training a DNN with an SGD optimizer.
One reason we might see a vanishing gradient is due to the choice of activiation function.
An old choice for activation functions were sigmoids
\[\underset{(K_1 \times 1)}{\mathbf h_1} = \varphi(\mathbf q_1)\]
\[\varphi(\mathbf q_1) = \sigma(q_1) = \frac{1}{1 + \exp[-q_1]}\]
The sigmoid was used a lot because it has a simple continuous derivative:
\[\varphi'(q_1) = \sigma(q_1)(1 - \sigma(q_1))\]
However, this activation function saturates and leads to gradient problems.
At any layer, the gradient for the weights and the hidden units using the sigmoid activation function:
\[\frac{\partial \mathcal L}{\partial h_{d - 1}} = \mathcal u \odot \sigma(q_d)(1 - \sigma(q_d)) \otimes \mathbf W_d^T\]
\[\frac{\partial \mathcal L}{\partial \mathbf W_D^T} = \mathbf u_2^T \odot (\sigma(q_d)(1 - \sigma(q_d)))^T \otimes \mathbf h_{d-1}\]
If any \(q_d\) is sufficiently large (>3ish) or small (<-3ish), the derivative of the sigmoid activation function will essentially be zero!
How can we fix this problem?
Prevent hidden values from getting too big or too small (normalization)
Modify the activation function so that they don’t saturate (recitifiers)
Choose good initial values (Glorot initialization)
Modify the DNN architecture such that the updates are additive instead of multiplicative (residual connections; 13.4.4 in PML1; More with CNNs)
Non-saturating activation functions:
An activation function where the derivative doesn’t saturate at zero over the real line.
The classic example is RelU:
\[\varphi(q_1) = \max(0,q_1)\]
With the simple derivative of 0 if \(q\) is less than 0 and 1 if it’s greater.
A first thought:
ReLU saturates on the negative side!
You’re right! And this can be problematic. But, the benefits of ReLU outweigh the negatives.
The big benefit of these piecewise activation function is that it induces sparsity in the hidden units
Each observation only has a few non-zero hidden values across the different hidden units
Allows each hidden unit to correspond to a sparse set of hidden features
Always continuous activation functions don’t have this explicit property
While other approaches to solving the vanishing gradient problem are more popular, we could always come up with variants to the ReLU activation that don’t completely zero out units that are less than zero!
The cost is specificity of the hidden units to coherent subfunctions of the original feature space.
Leaky ReLU
\[\varphi(\mathbf q_1) = \max(\alpha q_1, q_1)\]
where \(0 < \alpha < 1\)
ELU
\[\varphi(\mathbf q_1) = \begin{cases} q_1 & \text{if } q_1 > 0 \\ \alpha(e^{q_1} - 1) & \text{if } q_1 \leq 0 \end{cases}\]
Changing the activation function can help!
It just loses some of the clarity of the learned hidden features.
ReLU promotes sparsity which can lead to less complex models!
Not having a rectifier leads the hidden units to not really correspond with meaningful parts of the image!
This is one of the hallmarks of DNNs!
Even though ReLU is better than the sigmoid for the positive parts of the hidden units in terms of vanishing gradients, it still has problems when a lot of values are placed below zero
Large negative weights
Bad starting values
When the gradient at a specific hidden unit goes to zero for good (e.g. the weights move highly negative and lead everything to get mapped to zero by ReLU), this is called the dead ReLU problem
Batch Normalization
The problem of vanishing gradients occurs when most of the hidden values coming out of a node are mapped exactly to zero through the activation layer.
This can be problematic as the NN in suboptimal locations along the gradient descent path gets trapped and cannot escape.
If a hidden unit is actually supposed to be mostly zero, it will get there pretty quickly once we’re around the minimum loss coefficients
Get everything close to where it needs to be before we get to the minimum of the loss space and then adjust to zero when we get there
The advantage is that was don’t have situations where the gradient rushes to zero before we hit the minimum
This can be achieved with minimal additional computation and change to the training procedure through batch normalization of the hidden values!
Over a batch of \(B\) instances being considered at iteration \(t\) of the SGD optimizer, the goal is to ensure that each set of \(B\) hidden units - \(\mathbf q_d\), a \(B \times K_d\) matrix of values - has columns that have mean zero and variance 1.
Given a column of hidden values (post-activation):
\[\tilde{\mathbf q}_k = \gamma \odot \hat{\mathbf q}_k + \delta\]
\[\hat{\mathbf q_k} = \frac{\mathbf q_k - \mu_k}{\sqrt{\sigma^2_k + \epsilon}}\]
\[\mu_k = \frac{1}{B} \sum \limits_{q_k \in B} q{j,k}\]
\[\sigma^2_k = \frac{1}{B} \sum \limits_{q_k \in B} (q{j,k} - \mu_k)^2\]
In words, we’re standardizing each hidden unit!
This prevents the weights from ever getting too big or too small.
And prevents large negatives or positives from occurring!
This prevents vanishing and exploding gradients!
The cost is that early stopping can lead to unclear activations
However, when we converge to a minimum, we expect that the activations will creep very close to zero
This method can easily be implemented using PyTorch
Up until this point, we haven’t really discussed the problem of overfitting with DNNs
We know that NNs are universal approximators
This is not desireable if our goal is to create a predictive model!
The longer we train the NN, the more we overfit to the training data
We know that the “tuning parameters” for the NN are the number of hidden units in each layer
As the number increases, the more likely we are to overfit to the data
Unless we have insanely large sample sizes
Maybe we could try lots of different values for the number of units?
Training NNs is pretty quick.
But still computationally expensive.
Additional problem: fast fitting
We move too quickly towards a minimum
Can matter when there are better minima that exist that require small moves to see them
Beeline vs. Slow Stroll
Really small learning rate?
Instead, different methods are used for a specific NN specification
Train a 512 hidden unit NN
Use different regularization methods to try to prevent overfitting to the training data
Method 1: Smaller Batch Sizes
Let stochastic gradient descent do the work
Can y’all come up with some reasons why this would work?
Can y’all come up with some reasons why this isn’t always advisable?
Method 2: Early stopping criteria related to a validation set
Split the original training data into training and validation sets
Monitor the loss w.r.t. the training and validation data at the end of each epoch
Stop when the validation loss increases!
A common sense approach
We know that we’ll improve the fit to the validation data well at the beginning
At a certain point, we learn the underlying functional form as well as we can
Every “improvement” after that is just responding to noise!
This approach should be used, but comes with a few cautions.
Early stopping using a split of training and validation data is a must for fitting DNNs!
Early stopping is just a heuristic method, though.
Maybe we can use methods to try to explicitly do this.
Consider a two-layer neural network with 512 hidden units in each layer and 10 categories.
Let’s examine the number of coefficients:
First layer: \(512 \times P\)
Second layer \(512 \times 512\)
Loss layer: \(512 \times 10\)
That is 668,672 coefficients!
Even for the MNIST data with 20,000 training observations, that’s too many parameters!
Most of these parameters can’t actually do anything to help the model!
The model is overidentified
However, the model seems to do okay even with this incredibly overidentified parameter space
This is a function of SGD and its inherent regularization properties!
Rather than fitting an overidentified model and stopping it early, we can also try to regularize the model space to encourage most weights to be close to zero.
For a deep neural network with D layers, we have \(D + 1\) sets of weights on the units:
\[\bb, \mathbf W_D, \mathbf W_{D - 1},...,\mathbf W_{1}\]
For each set of weights, define the squared Frobenius norm of the weight matrix as:
\[\|\mathbf W_d\|^2_F = \sum \limits_{i = 1}^{K_{d-1}} \sum \limits_{j = 1}^{K_{d}} w_{ij}^2\]
Fitting a NN with weight decay amounts to fitting a NN with L2 regularization of the weights:
\[ \frac{1}{N} \sum \limits_{i = 1}^N \mathcal L(\boldsymbol \theta | \mathbf x_i , y_i) + \lambda \sum \limits_{d = 1}^D \|\mathbf W_d\|^2_F\]
Each set of weights is penalized in the original loss specification
Like setting a Ridge penalty for the entire neural network!
Will still eventually get to zero training error (plus a constant from the penalty term)
Helps to slow the process down to hopefully find better generalizable solutions!
Weight decay helps to close the gap between training and generalization error
Tuning \(\lambda\) is more art than science for NNs
\(\lambda\) sorta controls the number of hidden units!
We’re shrinking most weights to 0
Equivalent to choosing the number of hidden units without needing to retrain a bunch
We could also induce a L1 penalty on the weights
This isn’t popular because SGD doesn’t play very well with L1 penalties
A more poular method of inducing sparse weight sets is to employ dropout
Dropout
With probability \(p\), set the weight between node \(i\) and node \(j\) to zero. Do this randomly for each instance in each iteration.
Dropout works well because it forces the NN to learn robust dependencies among the hidden values
If the connection isn’t always there, then the other connections need to make up for the missingness
Akin to an ensemble of networks that will do well on small subsets of the data
Think a random forest for NNs!
\[ \begin{equation*} \mathbf{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} & w_{14} & w_{15} & \ldots \\ w_{21} & w_{22} & w_{23} & w_{24} & w_{25} & \ldots \\ w_{31} & w_{32} & w_{33} & w_{34} & w_{35} & \ldots \\ w_{41} & w_{42} & w_{43} & w_{44} & w_{45} & \ldots\\ w_{51} & w_{52} & w_{53} & w_{54} & w_{55} & \ldots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots\\ \end{bmatrix} \end{equation*} \]
\[ \begin{equation*} \mathbf{W} = \begin{bmatrix} 0 & w_{12} & w_{13} & 0 & w_{15} & \ldots \\ w_{21} & 0 & w_{23} & w_{24} & 0 & \ldots \\ w_{31} & w_{32} & 0 & w_{34} & w_{35} & \ldots \\ 0 & w_{42} & w_{43} & 0 & w_{45} & \ldots\\ w_{51} & 0 & w_{53} & w_{54} & 0 & \ldots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots\\ \end{bmatrix} \end{equation*} \]
Dropout is sorta like fitting a noisy version of the NN weights
With probability \(p\) of dropout:
\[\theta_{dij} = \mathbf W_{dij} \epsilon_{di}\]
where \(\epsilon_{di} = 1 - p\).
After training a dropout network, we multiply the final weights by \(1 - p\) to get an approximation of the true weights for the network to use for predictions!
Though the “default” dropout is .5, I have found that setting the level to .2 or .25 seems to work really well for a lot of problems
My workflow (that doesn’t work for every problem):
Early stopping
Dropout if it’s a really picky network
Computationally quick and leads to good generalizable NNs
This is just an overview of generalization methods for NNs
DL and PML1/PML2 have a lot of other methods for making NNs generalize better!
This has been a pretty thorough overview of NN architecture
Next class, we’ll extend our toolkit to include images and methods that respect pixel location